Skip to content

refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223

Open
rapsealk wants to merge 5 commits into
mainfrom
perf/11220-sysfs-first-stats
Open

refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223
rapsealk wants to merge 5 commits into
mainfrom
perf/11220-sysfs-first-stats

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 22, 2026

Closes #11220 (BA-5860)
Refs #11216

Summary

  • Default CPU/memory stat collection to StatModes.CGROUP on native Linux (sysfs path, ~100-500 ms → <10 ms per sample).
  • Keep StatModes.DOCKER as the forced path on linuxkit / Docker Desktop and as a per-read fallback when a cgroup sysfs read fails.
  • Dedup "cgroup sysfs failed, falling back to Docker API" warnings per (plugin, container_id) via a bounded cachetools.LRUCache so a systemic sysfs failure doesn't storm logs.
  • Replace raise RuntimeError("should not reach here") in exhaustive match defaults with typing.assert_never (satisfies the agent rule banning built-in exceptions in business logic + strengthens mypy exhaustiveness).
  • No changes to network/IO stat collection.

Operator note — possible metric step-down on upgrade

Hosts previously running stats-type: docker may see:

  • Memory: reported usage steps down by the inactive-file cache amount (sysfs now matches docker stats CLI semantics).
  • Block I/O (cgroup v1 only): readings may shift because sysfs reads blkio.throttle.io_service_bytes while the Docker API path reads blkio_stats.io_service_bytes_recursive.

Dashboards and autoscaling thresholds tuned to the old values should be re-evaluated.

Why

Agent-runtime performance epic #11216; sysfs is an order of magnitude faster for CPU/memory stats than the Docker API round-trip.

Test plan

  • pants test tests/unit/agent:: passes (CPU + Memory plugin fallback, linuxkit guard, warn dedup)
  • On a native Linux host, confirm CPU/memory stats populate via cgroup path (no Docker API calls for those fields)
  • Simulate a cgroup read failure; confirm the Docker-API fallback triggers and the one-time warning fires

On Linux native hosts, the sysfs (cgroup) stat collection path is
strictly faster (<10 ms) than the Docker API path (100-500 ms), yet
the previous default selected the slow path for most cases.

- Change the `container.stats-type` default from `docker` to unset so
  that the agent auto-selects the fastest available path (`cgroup` on
  native Linux, `docker` elsewhere).
- Wrap the cgroup CPU and memory readers with a per-read Docker API
  fallback: when a sysfs read returns None (missing path or unreadable),
  emit a rate-limited warning and fall back to the Docker API instead
  of silently dropping the sample.
- Force the Docker API path on linuxkit / Docker Desktop hosts even
  when `StatModes.CGROUP` is selected, since cgroup paths are not
  reliably accessible there.

Network and block I/O stats continue to use the Docker API path.
`StatModes.DOCKER` remains available as an explicit config override.

Closes #11220

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk added this to the 26.5 milestone Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 04:32
@github-actions github-actions Bot added size:L 100~500 LoC comp:agent Related to Agent component labels Apr 22, 2026
@rapsealk rapsealk changed the title enhance(agent): default to sysfs-first CPU/memory stats on native Linux refactor(agent): Default to sysfs-first CPU/memory stats on native Linux Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the agent’s default container stats collection strategy so that native Linux hosts prefer the faster cgroup/sysfs path for CPU/memory, while keeping Docker API usage for linuxkit/Docker Desktop and as a runtime fallback when cgroup reads fail.

Changes:

  • Switch stats_type default from docker to “auto” (unset/None) so StatContext selects the preferred mode per host.
  • Add per-container cgroup→Docker-API fallback behavior (and one-time warning) for CPU and memory stats collection; force API on linuxkit.
  • Add unit tests covering CPU fallback-on-failure and linuxkit forcing API.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/unit/agent/test_docker_intrinsic.py Adds CPUPlugin unit tests for cgroup sysfs failure fallback and linuxkit forcing API.
src/ai/backend/agent/docker/intrinsic.py Implements cgroup-first fallback logic for CPU/Memory and linuxkit forcing API; adds warning suppression helper.
src/ai/backend/agent/config/unified.py Changes stats_type default to None (auto-select) and updates help text/validator typing.
changes/11220.enhance.md Adds changelog entry for the new default behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to +90
# Tracks containers for which we have already logged a cgroup->Docker-API
# fallback warning, to avoid log spam on persistent read failures.
_cgroup_fallback_warned: set[str] = set()


def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None:
key = f"{plugin}:{container_id}"
if key in _cgroup_fallback_warned:
return
_cgroup_fallback_warned.add(key)
log.warning(
"{0}: cgroup sysfs read failed for container {1}; falling back to Docker API",
plugin,
container_id[:7],
)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cgroup_fallback_warned is a process-global set that only grows and is never pruned. On agents with high container churn this can become an unbounded memory sink over time. Consider bounding it (e.g., LRU/TTL cache) or clearing entries when containers are removed / after some time window.

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +91
def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None:
key = f"{plugin}:{container_id}"
if key in _cgroup_fallback_warned:
return
_cgroup_fallback_warned.add(key)
log.warning(
"{0}: cgroup sysfs read failed for container {1}; falling back to Docker API",
plugin,
container_id[:7],
)

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_warn_cgroup_fallback_once() is meant to prevent log spam on persistent sysfs failures, but sysfs_impl() already logs a warning on every OSError before returning None. With the new fallback, persistent failures will still spam the existing sysfs warning (and add one more warning once). Consider moving the per-container warning solely to the fallback layer (or otherwise suppressing/reducing repeated sysfs warnings) so the anti-spam behavior is effective.

Copilot uses AI. Check for mistakes.
Comment on lines 1344 to +1349
"Method for collecting container resource statistics. "
"'docker' uses Docker's stats API (most compatible). "
"'cgroup' reads from cgroup filesystem directly (more accurate, requires root). "
"'null' disables statistics collection."
"'cgroup' reads from cgroup filesystem directly (faster and more accurate, "
"requires root on Linux). "
"When unset, the agent auto-selects: cgroup on native Linux hosts "
"(with Docker API fallback on read failure) and docker elsewhere."
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says there are no changes to network/IO stat collection, but switching the default stats_type to auto-select cgroup on native Linux will cause MemoryPlugin.gather_container_measures() to use its sysfs implementation by default (which collects io/net via cgroup + /proc), rather than Docker stats API. If the intent is to keep network/IO coming from the Docker stats API by default, the mode selection likely needs to be split (CPU/memory via cgroup, net/IO via API) or the PR description should be updated to reflect the behavior change.

Copilot uses AI. Check for mistakes.
Comment on lines +851 to +866
async def cgroup_first_impl(
container_id: str,
) -> tuple[int, int, int, int, int, int, int] | None:
result = await sysfs_impl(container_id)
if result is None:
_warn_cgroup_fallback_once("MemoryPlugin", container_id)
return await api_impl(container_id)
return result

match ctx.mode:
case StatModes.CGROUP if not _is_linuxkit(self.local_config):
impl = cgroup_first_impl
case StatModes.CGROUP | StatModes.DOCKER:
impl = api_impl
case _:
raise RuntimeError("should not reach here")
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryPlugin now has a cgroup-first fallback-to-API path (and linuxkit forcing API) similar to CPUPlugin, but the new behavior isn't covered by unit tests (unlike CPU where new tests were added). Please add tests that (1) verify MemoryPlugin falls back to fetch_api_stats() when sysfs_impl() returns None and (2) verify linuxkit forces API even when ctx.mode is CGROUP.

Copilot uses AI. Check for mistakes.
rapsealk and others added 2 commits April 22, 2026 14:11
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extend changelog to call out the memory metric shift (sysfs
  excludes inactive file cache; dashboards may step-down).
- Bound the _cgroup_fallback_warned set to prevent unbounded growth
  on hosts with high container churn.
- Replace RuntimeError("should not reach here") with assert_never()
  to satisfy the BackendAIError-only rule and strengthen exhaustive
  match-statement type checking.
- Add parallel MemoryPlugin fallback + linuxkit tests and a
  dedup-warn test.

Refs #11220
Refs #11223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

Review summary from an independent pass + resolutions pushed to this branch (commit 9cc2b7e71).

Verdict: ship with changes — addressed.

Finding Resolution
Memory metric will step-down on upgrade (sysfs excludes inactive-file cache; matches docker stats CLI) — not called out for operators changes/11223.enhance.md extended to warn that dashboards and autoscaling thresholds tuned to the previous stats-type: docker values may need re-tuning
_cgroup_fallback_warned: set[str] grew unbounded on container churn Replaced with cachetools.LRUCache(maxsize=1024) — same pattern used elsewhere in the agent
raise RuntimeError("should not reach here") at two match ctx.mode: sites violates the BackendAIError-only rule Both sites now use assert_never(ctx.mode), which also strengthens mypy exhaustiveness
No mirror tests for MemoryPlugin fallback / linuxkit guard; no dedup-warn tests Added 4 tests: mirror CPU fallback + linuxkit tests for MemoryPlugin, plus TestWarnCgroupFallbackOnce covering dedup and LRU eviction

Cgroup v1/v2 both handled (pre-existing). Network/IO still goes through Docker API per spec. Selection via StatModes.get_preferred_mode() correctly picks up is_containerized() — agents running inside a container stay on DOCKER mode. Cgroup-first wraps sysfs with a per-read Docker-API fallback, so transient sysfs errors downgrade gracefully rather than silently zeroing metrics. Linuxkit guard belt-and-suspenders against Docker Desktop / colima.

_is_linuxkit() wrapped a `==` comparison in bool(...), which is a
no-op since == already returns bool.

Refs #11220
Refs #11223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extend changes/11223.enhance.md with a short note that block-I/O
  readings may also step down on cgroup v1 hosts (the sysfs path reads
  blkio.throttle.io_service_bytes; the API path reads
  io_service_bytes_recursive).
- Drop TestWarnCgroupFallbackOnce::test_evicts_beyond_limit — it was
  re-verifying cachetools.LRUCache eviction, not project logic.
  test_deduplicates_per_container already covers the project contract.

Refs #11220
Refs #11223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rapsealk added a commit that referenced this pull request Apr 22, 2026
- Add TODO(#11223) notes on CPU/Memory api_impl paths: after sysfs-first
  lands, the Docker stream on these plugins is only needed for network
  and blkio (which neither consumes), so the stream should migrate to
  a network/IO consumer.
- Add TODO(#11232) at the two per-plugin DockerStatsStreamer instantiation
  sites flagging the shared-streamer refactor.
- Add a test covering the DockerError 404 branch of _read_stream: reader
  exits cleanly on container-gone without spinning retries.

Refs #11219
Refs #11223
Refs #11224
Refs #11232

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk changed the title refactor(agent): Default to sysfs-first CPU/memory stats on native Linux refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default to sysfs-first CPU/memory stats on Linux

2 participants